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(54) Method for analyzing the performance of application programs 



(57) Method, systems and articles of manufacture 
consistent with the present invention collects and dis- 
plays performance data associated with executed pro- 
grams. A system consistent with an implementation of 
the present invention collects performance analysis 



information from various hardware and software compo- 
nents of an instrumented program, and displays the per- 
formance data in a multi-dimensional format. 
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Description 

Field of the Invention 

[0001] The present invention relates generally to s 
performance analysis and more specifically to methods 
for providing a multidimensional view of performance 
data associated with an application program. 

Pgckgrpgnd 10 

[0002] Multi-threading is the partitioning of an appli- 
cation program into logically independent "threads" of 
control that can execute in parallel. Each thread 
includes a sequence of instructions and data used by is 
the instructions to carry out a particular program task, 
such as a computation or input/output function. When 
employing a data processing system with multiple proc- 
essors, i.e., a multiprocessor computer system, each 
processor executes one or more threads depending 20 
upon the number of processors to achieve multi- 
processing of the program. 

[0003] A program can be multi-threaded and still 
not achieve multi-processing if a single processor is 
used to execute all threads While a single processor 25 
can execute instructions of only one thread at a time, 
the processor can execute multiple threads in parallel 
by, for example, executing nstrucaons corresponding to 
one thread until reaching a selected reduction, sus- 
pending execution of that thread, and executing instruc- 30 
tions corresponding to another thread, until all threads 
have completed. In this scheme, as long as the proces- 
sor has started executing instructons tor more than one 
thread during a given time nterval al executing threads 
are said to be "running" during that time interval. 35 
[0004] Multiprocessor computer systems are typi- 
cally used for executing application programs intended 
to address complex computational problems in which 
different aspects of a problem can be solved using por- 
tions of a program executing in parallel on different proc- 40 
essors. A goal associated with using such systems to 
execute programs is to achieve a high level of perform- 
ance, in particular, a level of performance that reduces 
the waste of the computing resources. Computer 
resources may be wasled, for example, rf processors 45 
are idle (i.e., not executing a program instruction) for 
any length of time. Such a wait cycle may be the result 
of one processor executing an instruction that requires 
the result of a set of instructions being executed by 
another processor. so 
[0005] It is thus necessary to analyze performance 
of programs executing on such data processing sys- 
tems to determine whether optimal performance is 
being achieved. If not, areas for improvement should be 
identified. 55 
[0006] Performance analysis in this regard gener- 
ally requires gathering information in three areas. The 
first considers the processor's state at a given time dur- 



ing program execution. A processors state refers to the 
portion of a program (for example, set of instructions 
such as a subprogram, loop, or other code block) that 
the processor is executing during a particular time inter- 
val. The second considers how much time a processor 
spends in transition from one state to another. The third 
considers how close a processor is to executing at its 
peak performance. These three areas do not provide a 
complete analysis, however. They fail to address a 
fourth component of performance analysis, namely, pre- 
cisely what a processor did during a particular state 
(e.g., computation, input data, output data, etc.). 
[0007] When considering what a processor did 
while in a particular state, a performance analysis tool 
can determine the affect of operations within a state on 
the performance level. Once these factors are identified, 
it is possible to synchronize operations that have a sig- 
nificant impact on performance with operations that 
have a less significant impact, and achieve a better 
overall performance level. For example, a first thread 
may perform an operation that uses significant 
resources while another thread scheduled to perform a 
separate operation in parallel with the first thread sits 
idle until the first thread completes its operation, ft may 
be desirable to cause the second thread to perform a 
different operation that does not require the first thread 
to complete its operation, thus eliminating the idle 
period for the second thread. By changing the second 
thread's schedule in this way the operations performed 
by both threads are better synchronized. 
[0008] When a performance analysis tool reports a 
problem occurring in a particular state, but fails to relate 
the problem to other events occurring in an application 
(for example, operations of another state), the informa- 
tion reported is relatively meaningless. To be useful a 
performance analysis tool must assist a developer in 
determining how performance information relates to a 
program's execution. Therefore, allowing a developer to 
determine the context in which a performance problem 
occurs, provides insight into diagnosing the problem. 
[0009] The process of gathering this information for 
performance analysis is referred to as "instrumenta- 
tion." Instrumentation generally requires adding instruc- 
tions to a program under examination so that when the 
program is executed the instructions generate data from 
which the performance information can be derived. 
[0010] Current performance analysis tools gather 
data in one of two ways: subprogram level instrumenta- 
tion and bucket level instrumentation. A subprogram 
level instrumentation method of performance analysis 
tracks the number of subprogram calls by instrumenting 
each subprogram with a set of instructions that gener- 
ate data reflecting calls to the subprogram. It does not 
allow a developer to track performance data associated 
with the operations performed by each subprogram or a 
specified portion of the subprogram, for example, by 
specifying data collection beginning and ending points 
within a subprogram. 
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[0011] A bucket level instrumentation performance 
analysis tool divides the executable code into evenly 
spaced groups, or buckets. Performance data tracks the 
number of times a program counter was in a particular 
bucket at the conclusion of a specified time interval. 
This method of gathering performance data essentially 
takes a snapshot of the program counter at the speci- 
fied time interval. This method fails to provide compre- 
hensive performance information because it only 
collects data related to a particular bucket during the 
specified time interval. 

[0012] The current performance analysis methods 
fail to provide customized collection or output of per- 
formance data. Generally, performance tools only col- 
lect a pre-specif ied set of data to display to a developer. 

Summary of the Invention 

[001 3] Methods, systems, and articles of manufac- 
ture consistent with the present invention overcome the 
shortcomings of the prior art by facilitating performance 
analysis of multi-threaded programs executing in a data 
processing system. Such methods, systems, and arti- 
cles of manufacture analyze performance of threads 
executing in a data processing system by receiving data 
reflecting a state of each thread executing during a 
measurement period, and displaying a performance 
level corresponding to the state of each thread during 
the measurement period. 

Brief Description of the Drawings 

[0014] The accompanying drawings, which are 
incorporated in and constitute a part of this specifica- 
tion, illustrate an implementation of the invention and, 
together with the description, serve to explain the 
advantages and principles of the invention. In the draw- 
ings, 

FIG. 1 depicts a data processing system suitable for 
implementing a performance analysis system con- 
sistent with the present invention; 
FIG. 2 depicts a block diagram of a performance 
analysis system operating in accordance with 
methods, systems, and articles of manufacture 
consistent with the present invention; 
FIG. 3 depicts a flow chart illustrating operations 
performed by a performance analysis system con- 
sistent with an implementation of the present inven- 
tion; and 

FIG. 4 depicts a multi-dimensional display of the 
performance data associated with an application 
program that has been instrumented in accordance 
with an implementation of the present invention. 

Detailed Description 

[0015] Reference will now be made in detail to an 
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implementation consistent with the present invention as 
illustrated in the accompanying drawings. Wherever 
possible, the same reference numbers will <be used 
throughout the drawings and the following description to 
5 refer to the same or like parts. 

Overview 

[0016] Methods, systems, and articles of manufac- 

10 ture consistent with the present invention utilize per- 
formance, data collected during execution of an 
application program to illustrate graphically for the 
developer performance data associated with the pro- 
gram. The program is instrumented to generate the per- 

75 formance data during execution. Each program thread 
performs one or more operations, each operation 
reflecting a different state of the thread. The perform- 
ance data may reflect an overall performance for each 
thread as well as a performance level for each state 

20 within a thread during execution. The developer can 
specify the type and extent of performance data to be 
collected. By providing a graphical display of the per- 
formance of ail threads together, the developer can see 
where to make any appropriate adjustments to improve 

25 overall performance by better synchronizing operations 
among the threads. 

[0017] A performance analysis database access 
language is used to instrument the program in a manner 
consistent with the principles of the present invention. 

30 Instrumentation can be done automatically using known 
techniques that add instructions to programs at specific 
locations within the programs, or manually by a devel- 
oper. The instructions may specify collection of perform- 
ance data from multiple system components, for 

35 example, performance data may be collected from both 
hardware and the operating system. 
[0018] A four-dimensional display of performance 
data includes information on threads, times, states, and 
performance level. A performance analyzer also evalu- 

40 ates quantitative expressions corresponding to perform- 
ance metrics specified by a developer, and displays the 
computed value. 

Performance Analysis System 

45 

[0019] FIG. 1 depicts an exemplary data processing 
system 1 00 suitable for practicing methods and systems 
consistent with the! present invention. Data processing 
system 100 includes a computer system 105 connected 
so to a network 1 90, such as a Local Area Network, Wide 
Area Network, or the Internet 

[0020] Computer system 105 contains a main 
memory 130, a secondary storage device 140, a proc- 
essor 1 50, an input device 1 70. and a video display 1 60. 
55 These internal components exchange information with 
one another via a system bus 165. The components are 
standard in most computer systems suitable for use with 
practicing methods and configuring systems consistent 



3 



5 



EP 1 026 592 A2 



6 



with the present invention. One such computer system 
is the SPARCstation from Sun Microsystems, inc. 
[0021] Although computer system 100 contains a 
single processor, it will be apparent to those skilled in 
the art that methods consistent with the present inven- 
tion operate equally as well with a multi -processor envi- 
ronment. 

[0022] Memory 130 includes a program 110 and a 
performance analyzer 115. Program 110 is a multi- 
threaded program. For purposes of facilitating perform- 
ance analysis of program 110 in a manner consistent 
with the principles of the present invention, the program 
is instrumented with appropriate instructions of the 
developer's choosing to generate certain performance 
data. 

[0023] Performance analyzer 115 is comprised of 
two components. The first component 115a is a library 
of functions to be performed in a manner specified by 
the instrumented program. The second component 
1 15b is a developer interface that is used for two func- 
tions: (1) automatically instrumenting a program; and 
(2) viewing performance information collected when an 
instrumented program is executed. 
[0024] As explained, instrumentation can be done 
automatically with the use of performance analyzer 
interface 115b. According to this approach, the devel- 
oper simply specifies for the analyzer the type of per- 
formance data to be collected and the analyzer adds the 
appropriate commands from the performance analysis 
database access language to the program in the appro- 
priate places. Techniques for automatic instrumentation 
in this manner are familiar to those skilled in the art. 
Alternatively, the developer may manually insert com- 
mands from the performance analysis database access 
language in the appropriate places in the program so 
that during execution specific performance data is 
recorded. The performance data generated during exe- 
cution of program 1 10 is recorded in memory, for exam- 
ple, main memory 130. 

[0025] Performance analyzer interface 11 5b per- 
mits developers to view performance information corre- 
sponding to the performance data recorded when 
program 110 is executed. As explained below, the 
developer may interact with the analyzer to alter the 
view to display performance information in various con- 
figurations to observe different aspects of the program's 
performance without having to repeatedly execute the 
program to collect information for each view, provided 
the program was properly instrumented at the outset. 
Each view may show (i) a complete measurement cycle 
for one or more threads; (ii) when each thread enters 
and leaves each state; and (iii) selected performance 
criteria corresponding to each state. 
[0026] Although not shown in FIG. 1 , like all compu- 
ter systems, system 105 has an operating system that 
controls its operations, including the execution of pro- 
gram 110 by processor 150. Also, although aspects of 
one implementation consistent with the principles of the 
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present invention are described herein with perform- 
ance analyzer stored in main memory 120, one skilled 
in the art will appreciate that all or part of systems and 
methods consistent with the present invention may be 
5 stored on or read from other computer-readable media, 
such as secondary storage devices, like hard disks, 
floppy disks, and CD-ROM; a carrier wave received 
from the Internet; or other forms of ROM or RAM. 
Finally, although specific components of data process- 
io ing system 100 have been described, one skilled in the 
art will appreciate that a data processing system suita- 
ble for use with the exemplary embodiment may contain 
additional or different components. 
[0027] FIG. 2 depicts a block diagram of a perform- 
ance analysis system consistent with the present inven- 
tion. As shown, program 210 consists of multiple 
threads 212, 214, 216, and 218. Processor 220 exe- 
cutes threads 212, 214, 216, and 218 in parallel. Mem- 
ory 240 represents a shared memory that may be 
so accessed by all executing threads. A protocol for coordi- 
nating access to a shared memory is described in U.S. 

Patent Application Serial No. ' , of 

Shaun Dennie, entitled "Protocol for Coordinating the 
Distribution of Shared Memory", (Attorney Docket No. 
25 06502-0207-00000), which is incorporated herein by 
reference. Although a single processor 220 is shown, 
multiple processors may be used to execute threads 
212, 214,216, and 218. 

[0028] To facilitate parallel execution of multiple 
- 30 threads 212, 214, 216, and 218, an operating system 
partitions memory 240 into segments designated for 
operations associated with each thread and initializes 
the field of each segment. For example, memory seg- 
ment 245 is comprised of enter and exit state identifiers, 
35 developer specified information, and thread identifica- 
tion information. An enter state identifier stores data 
corresponding to when, during execution, a thread 
enters a particular state. Similarly, an exit state identifier 
stores data corresponding to when, during execution of 
40 an application program, a thread leaves a particular 
state. Developer specified data represents the perform- 
ance analysis data collected. 

[0029] A reserved area of memory 250 is used to 
perform administrative memory management functions, 
45 such as, coordinating the distribution of shared memory 
to competing threads. The reserved area of memory 
250 is also used for assigning identification information 
to threads using memory. 

[0030] The flow chart of FIG. 3 provides additional 
so details regarding the operation of a performance analy- 
sis system consistent with an implementation of the 
present invention. Instructions that generate perform- 
ance data are inserted into a program (step 305). The 
instrumented program is executed and the performance 
55 data are generated (steps 310 and 31 5). In response to 
a request to view performance data, performance ana- 
lyzer accesses and displays the performance data (step 
320). 
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[0031 ] Performance analyzer is capable of display- 
ing both the performance data and the related source 
code and assembly code, i.e., machine instructions, 
corresponding to the data. This allows a developer to 
relate performance data to both the source code and 
the assembly code that produced the data. 
[0032] FIG. 4 shows a display 400 with two parts 
labeled A and B, respectively. The first part, labeled A, 
shows the performance characteristics of an application 
program in four dimensions: threads, time, states, and 
performance. Performance information for each thread 
is displayed horizontally using a bar graph-type format. 
Time is represented on the horizontal axis; performance 
is represented on the vertical axis. 
[0033] Two threads, thread 1 and thread 2 in display 
400, were executing concurrently. As shown, the 
threads began executing at different times. The horizon- 
tal axis for thread 1 is labeled 402. Thread 1 began exe- 
cuting at a point in time labeled "x" on the horizontal axis 
402. The horizontal axis for thread 2 is labeled 404. 
Thread 2 began executing at time "b". Each thread per- 
formed operations in multiple states, each state being 
represented by a different pattern. Thread 2 was idle at 
the beginning of the measuring period. One reason for 
this idle period may be that thread 2 was waiting for 
resources from thread 1 . Based on this information, a 
developer can allocate operations of a thread among 
states such that performance will be improved, for 
example, by not executing concurrent operations that 
require use of the same system resources. 
[0034] As shown, thread 1 entered state 410 at a 
point in time "x" on the horizontal axis 402 and left state 
410 at time "y". and entered state 420 at time "m" and 
left state 420 at time "n". The horizontal distance 
between points V and V is shorter than the horizontal 
distance between points "m" and Yf. Therefore, thread 
1 operated in state 420 longer than it operated in state 
41 0. The vertical height of the bars show a level of per- 
formance. The vertical height for state 410 is lower than 
the vertical height for state 420, showing that states 41 0 
and 420 operated at different levels of performance. The 
change in vertical height as an executing thread transi- 
tions from one state to another corresponds to changes 
in performance level. This information may be used to 
identify the affect of transitioning between consecutive 
states on performance, and directs a developer to areas 
of the program for making changes to increase perform- 
ance. 

[0035] The bottom-half of the display, labeled B, 
illustrates an expression evaluation feature of the per- 
formance analyzer's interface. A developer specifies 
computational expressions related to a performance 
metric of a selected state(s). The performance analyzer 
computes the value of an expression for the perform- 
ance data collected. 

[0036] In the example shown, the developer has 
selected state 440. The expression on the first line, 
"NUM.OPS/ (1 00000 *TIME) n , is an expression for com- 



puting the number (in millions) of floating point instruc- 
tions per second (MFLOPS). The expression on the 
second line, "2\CPU.MHZ° calculates a theoretical 
peak level of performance for a specified state. Perform- 

5 ance analyzer may evaluate these two expressions in 
conjunction to provide quantitative information about a 
particular state. For example, by dividing MFLOPS by 
the theoretical peak performance level for state 440, 
performance analyzer calculates for the developer the 

10 percentage of theoretical peak represented by each 
operation in state 440. 

Conclusion 

is [0037] Methods and systems consistent with the 
present invention collect performance data from hard- 
ware and software components of an application pro- 
gram, allowing a developer to understand how 
performance data relates to each thread of a program 

20 and complementing a developer's ability to understand 
and subsequently diagnose performance issues occur- 
ring in a program. 

[0038] Although the foregoing description has been 
described with reference to a specific implementation, 
25 those skilled in the art will know of various changes in 
form and detail which may be made without departing 
from the spirit and scope of the present invention as 
defined in the appended claims and the full scope of 
their equivalents. 

Claims 

1. A method for analyzing performance of threads 
executing in a data processing system, the method 

35 comprising: 

receiving data reflecting a state of each thread 
executing during a measuring period; and 
displaying a performance level corresponding 
40 to the state of each thread during the measur- 

ing period. 

2. The method of claim 1 wherein receiving includes: 

45 designating the data to be collected during the 

measuring period. 

3. The method of claim 1 wherein displaying includes: 

so accessing an data file created during program 

execution. 

4. A method for analyzing performance of threads 
executing in a data processing system, the method 

55 comprising: 

inserting instructions in a program that gener- 
ate data reflecting a performance level while a . 



30 



5 



BNSDOCID: <EP 1026592A2_I_> 



9 



EP 1 026 592 A2 



10 



program is executing; 

executing the program; and 

displaying the performance data according to 

the state of a thread operating in the program 

during a measuring period. s 

5. A method for analyzing performance of threads 
executing in a data processing system, the method 
comprising: 

10 

receiving a program instrumented with instruc- 
tions that generate performance data; 
executing the program and collecting the per- 
formance data; and 

displaying a performance level corresponding is 
to the state of each thread during a measure- 
ment period. 

6. A method for analyzing performance of threads 
executing in a multi-processor computing system, so 
the method comprising: 

designating a particular time interval of an exe- 
cuting program to collect performance data; 
and 25 
displaying data corresponding to a perform- 
ance level generated during execution of the 
program. 

7. A system for collecting performance data associ- 30 
ated with threads executing in a multi-processor 
environment including: 



during a measuring period; and 
displaying a performance level corresponding 
to the predetermined performance-related 
aspect of the application program and reflect- 
ing changes in performance at every point in 
time in each state. 



an expandable set of commands for designat- 
ing collection of performance data related to a 35 
thread's execution in a particular state; and 
a storage area for storing the performance data 
collected. 

8. A system for displaying performance data associ- 40 
ated with threads executing in a multi-processor 
environment including: 

a set of performance data generated from an 
instrumented program; and 45 
a display of a performance level of each thread 
corresponding to the performance data. 



9. A method for analyzing performance of an applica- 
tion program executing in a data processing sys- so 
tern, wherein the application program is executed 
using multiple threads and each executing thread 
passes through multiple states, the method com- 
prising: 

55 

reading data corresponding to a predetermined 
performance-related aspect of the application 
program for more than one state of each thread 
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